#0 Data Cleaning Steps and Processing (Not directly related to the creation of graphs)

We should check some of the integrity of the data involved. I include my code below. This will not be used to create a graph, but is still useful nonetheless.

#Load in some relevant libraries.
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.3.3
library(purrr)
## Warning: package 'purrr' was built under R version 4.3.3
library(stringr)
## Warning: package 'stringr' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(htmlwidgets)
## Warning: package 'htmlwidgets' was built under R version 4.3.3
library(forcats)
## Warning: package 'forcats' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ readr     2.1.5     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(ggplot2)
library(RCurl)
## Warning: package 'RCurl' was built under R version 4.3.3
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
#Let's load in our main dataset.
w<- read.csv("https://raw.githubusercontent.com/erdenetuya2080/DATA-5310-Data-Visualization/refs/heads/main/vgsales.csv")

The most crucial thing to check is if the columns actually do sum up to the global sales. Let’s check.

w[w["NA_Sales"]+w["EU_Sales"]+w["JP_Sales"]+w["Other_Sales"]-w["Global_Sales"]>0.02,]
#This code checks if all of the sales columns add up to the global sales column by seeing if the differenc is greater than 0.02 (or 20,000).

Almost all of the data matches. We see that there is a slight discrepency of around 20,000, but this is not major, since this only affects about 4 rows. We can say our data seems fairly legitimate then, at least on the side of the sales.

Let’s now solve another crucial point in the data, which is the year column. There are some entries which have N/A as their entry. We could either remove the data or try to impute the data. Find the following code trying a few methods.

w$NA_Sales<-as.numeric(w$NA_Sales)
w$EU_Sales<-as.numeric(w$EU_Sales)
w$JP_Sales<-as.numeric(w$JP_Sales)

#We convert our sales to numeric just in case
w$Other_Sales<-as.numeric(w$Other_Sales)
w$Global_Sales<-as.numeric(w$Global_Sales)
w$Year<-as.numeric(w$Year)
## Warning: NAs introduced by coercion
#We attempt to convert year to numeric, coercing some columns to be NA.

  
w$ex_year <- as.numeric(str_extract(gsub("[[:punct:]]", "", w$Name), "\\d{4}"))
#Let's try our first technique, which is to extract any 4 consecutive digits from the data (ideally the year).


w <- w %>%
  mutate(Year = ifelse(is.na(Year), ex_year, Year))
#Let's replace all NA in Year with our extracted date.


w$ex_year2 <-as.numeric(str_extract(w$Name, "[012]\\d"))
#Let's see if we can extract two consecutive years since some games such as '07 might refer to 2007.

w <- w %>%
  mutate(Year = ifelse(is.na(Year), ex_year2+2000, Year))

#We substitute our 20XX column into year in case it is missing).

w$ex_year3 <-as.numeric(str_extract(w$Name, "[789]\\d"))

#Let's do the same thing but with 19XX (years from the past century)

w <- w %>%
  mutate(Year = ifelse(is.na(Year), ex_year3+1900, Year))

w$ex_year4 <-str_extract(w$Name, "\\dK\\d?")

#Let's assume some years appear as 2k4 or 2004.

w$ex_year4<-as.numeric(gsub("K","00", w$ex_year4))

#Let's just replace 2k4 with 2004, for example. It assumes people use "K" to stand for a thousand.

w <- w %>%
  mutate(Year = ifelse(is.na(Year), ex_year4, Year))

We can use this as our data for the video games now. We will clean the name column when we start joining.

#1 Editing Erdenetuya’s Bar Graph to be interactive and to change the order of the bars

NOTE: The following cell is not originally my code, but is instead just an updated version or Erdenetuya’s code which I updated to re-sort the columns.

#I rename w to "data" for this visual.
data <- w

# Reshape the Data for Regional Sales

region_sales <- data %>%
  select(Genre, NA_Sales, EU_Sales, JP_Sales, Other_Sales) %>%
  pivot_longer(cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales), 
               names_to = "Region", 
               values_to = "Sales") %>%
  group_by(Genre, Region) %>%
  summarise(Total_Sales = sum(Sales, na.rm = TRUE), .groups = "drop")

#I will begin giving comments here, as the following is my own code.

#We use forcats to alter the column, placing the most popular genres first and doing the same for region as well.

region_sales <- region_sales %>%
  mutate(Genre = forcats::fct_infreq(Genre), 
         Region = forcats::fct_rev(Region))  

#Note that I use the forcats library to perform the ordering.

region_sales$Genre=as.factor(region_sales$Genre)

#I then convert genre into a categorical variable for ease of ggplot.

p<-ggplot(region_sales, aes(x =reorder(Genre,-Total_Sales, sum), y =Total_Sales, fill =forcats::fct_rev(Region))) +
  geom_bar(stat = "identity") +
  #We use stat=identity with this, since we aren't using a histogram.
  labs(title = "Sales Distribution by Region and Genre",
       x = "Genre",
       y = "Sales (in millions)",
       fill = "Region") + #We add some labels.
  scale_fill_manual(
    values = c("NA_Sales" = "cyan3", "EU_Sales" = "coral", "JP_Sales" = "chartreuse3", "Other_Sales" = "orchid"),
    labels = c("NA_Sales" = "North America", "EU_Sales" = "Europe", "JP_Sales" = "Japan", "Other_Sales" = "Other")
  )+ #We want to explicitly try to match the graphs of our team. So let's match the colors. Additionally, let's get rid of the distasteful underscores by altering our labels.
  theme_minimal() + coord_flip() #Let's flip coordinates since I don't like when we have to change our x-axis angles.
p

Let’s try to make an interactive graph. The following code is mine, designed to create an alternative version of Erdenetuya’s graph above.

#I read the data again just in case.

data <-w


data$Genre<-as.factor(data$Genre)
#Let's convert genre into a categorical variable again.

region_box<-data %>%group_by(Genre) %>%summarise(EU_Sales=sum(EU_Sales),NA_Sales=sum(NA_Sales), JP_Sales=sum(JP_Sales), Other_Sales=sum(Other_Sales) , Global_Sales=sum(Global_Sales))

#Here we will perform a new aggregating function that sums up all the sales in each region.


#Let's now make use of plotly. How plotly works is that we will add each new line seperately.


#Let's use re-order on plotly to order the columns by highest sales to lowest sales, we will also go from greatest to smallest so we use "-Global_Sales" instead of "Global_Sales".
fig <- plot_ly(region_box, x = ~reorder(Genre, -Global_Sales, sum), y = ~NA_Sales, type = 'bar', name = 'NA Sales',
               marker = list(color = 'rgb(0, 205, 205)'))
#We add the NA sales bar, explicitly specifying the color since plotly is having trouble reading "cyan3".
fig <- fig %>% add_trace(y = ~EU_Sales, name = 'EU Sales', marker = list(color = 'coral'))
#We do the same with EU Sales and the rest#
fig <- fig %>% add_trace(y = ~JP_Sales, name = 'JP Sales', marker = list(color = 'chartreuse3'))
fig <- fig %>% add_trace(y = ~Other_Sales, name = 'Other Sales', marker = list(color = 'orchid'))
fig <- fig %>% layout(yaxis = list(title = 'Count'), xaxis=list(title="Genre"), barmode = 'stack')
#We re-name the column and row titles to be more readable. We also add in the other relevant groups other than North America.
fig
#Let's plot the graph
saveWidget(fig, file = "genre_sales_bar.html")
#We will save our widget so that we can open it later without the need for r studio.

#2 Scatter plot for games scores and sales.

Let’s see how sales and game scores correlate. To do this, we must join four different tables. We have our main table, then we have our ESRB table, our Steam reviews table, and our IGN reviews table. Let’s start joining.

We start with our IGN data table.

#We load the data.
i<-read.csv("https://raw.githubusercontent.com/JesseLoi/Test/refs/heads/main/IGN_data.csv")

#Let's clean the names first.

i$name<-tolower(gsub("[^a-zA-Z0-9]", "", i$game))
#This removes anything that isn't alphanumeric in the name of the game and creates a new column named "name".
i$year<-as.numeric(substr(i$released_date, 1,4))
#Let's extract the first four digits of the release date, which is year. We won't use this in the final plot, but is useful for those wanting to produce a time series plot of the different games by release year.

Let’s now create a table for Steam games.

#Let's load the data.
s<-read.csv("https://raw.githubusercontent.com/JesseLoi/Test/refs/heads/main/Copy%20of%20Steam%20Trends%202023%20by%20%40evlko%20and%20%40Sadari%20-%20Games%20Data%20(2).csv")


s$cleaned_name <- tolower(gsub("[^a-zA-Z0-9]", "", s$Title))
#We use the data cleaning technique we mentioned above to minimize the data.
s$Reviews.Score.Fancy <- gsub("%", "", s$Reviews.Score.Fancy)
#We want our scores to be numeric, so let's remove the percentage symbol.
s$Reviews.Score.Fancy <- gsub(",", ".", s$Reviews.Score.Fancy)
#I suspect our data is written such that the decimal is written as a comma, let's fix that by replacing the comma with a period.
s$review<-as.numeric(s$Reviews.Score.Fancy)
#We convert our score column to numeric.

We should also remember to clean our main dataset’s names to prepare it for joining.

#We clean the name column as we did in the above cells.
w$cleaned_column <- tolower(gsub("[^a-zA-Z0-9]", "", w$Name))
#We do this to join, removing any formatting particularities.

Let’s start by joining the Steam table with the vgchartz table. Let’s move onto the other tables later. To do this, let’s minimize the chance of a mis-match of the data while also attempting to generously join games . To do this, let’s join on the first 20 characters of the title. Therefore, we might be able to ignore edition notation such as “DS edition” in our data. We can, of course, adjust this value to be more or less liberal in our joining.

# Let's create a  column with the first 20 characters of the name in both data frames
s <- s %>% mutate(Name_prefix = substr(cleaned_name, 1, 20))
w <- w %>% mutate(Name_prefix = substr(cleaned_column, 1, 20))
#We will join on the shortened names. This also allows us to potentially join similar franchises, if need be. For example, we might join Call of Duty: Black Ops and Call of Duty: Ghosts if we make our Name_prefix short enough.

# We use an inner join to create a table where only matching data is available. We have no use for data without a Steam review of global sales measure.
result <- inner_join(w, s, by = "Name_prefix")
## Warning in inner_join(w, s, by = "Name_prefix"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 18 of `x` matches multiple rows in `y`.
## ℹ Row 4 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
#Our inner join seems to have duplicated our results. Let's remove all duplicates with the distinct() function.
r_unique <- result %>% distinct(Name, .keep_all = TRUE)

steam_table<-r_unique

We see we have a warning “many-to-many” relationship. This shows that our join ended up joining multiple different series. I will do with that result for now, since I want to preserve as much data as possible. However, for a more conservative join, one can simply extract more than 20 characters.

Let’s attempt the SAME technique with the IGN table as well.

# We take a shortened name to join more liberally.
i <- i %>% mutate(Name_prefix = substr(name, 1, 20))
w <- w %>% mutate(Name_prefix = substr(cleaned_column, 1, 20))




# Use inner join to get the most helpful joins.
result <- inner_join(w, i, by = "Name_prefix")
## Warning in inner_join(w, i, by = "Name_prefix"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 7083 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Use distinct() to avoid duplicate rows.
r_unique <- result %>% distinct(Name, .keep_all = TRUE)

ign_table<-r_unique

Let’s join the IGN ratings with the steam ratings and see the scatter plot.

ign_steam <- inner_join(steam_table, ign_table, by = "Name")

There is one small problem with our data, however. We notice that there are a lot of data points where we have a Steam review score of zero. This is a statistical anomaly, since it is almost impossible to have an average score of 0. So let’s remove them and even add the stipulation that they are to be removed if their IGN score is above 5, since it may be possible that there may be some truly atrocious games, though not any that would also have an IGN score greater than 5.

ign_steam_set<-ign_steam %>% filter(review==0 & score>5)
#Let's filter out statistical anomalies.

We can now move on to our visualization process

names(ign_steam)[names(ign_steam) == "Global_Sales.x"] <- "Global_Sales_Millions"
#Let's rename a column.

ign_steam<-ign_steam %>% filter(review !=0)
#We decide to remove ALL zero Steam review filters.

ign_steam <- ign_steam %>%
  mutate(esrb = if_else(esrb == "", "Not Rated", esrb))
#We change all empty ESRB ratings to Not Rated.

ign_steam <- ign_steam %>%
  mutate(esrb = factor(esrb, levels = c("Everyone", "Everyone 10+", "Teen", "Mature", "Rating Pending", "Not Rated")))
#We change ESRB to a categorical variable and apply an ordering so that it is ordered by age.

p<- ggplot(data=ign_steam, aes( x=review, y=score, size=Global_Sales_Millions, color=esrb, text=Name))
#Let's plot our graph. We also set text=name so that we can read it off our scatter plot later.

p+geom_point()+labs(title="Games News Scores and Global Sales", y="IGN Score", x="Steam Review Average", size = "Global Sales in Millions")+ scale_color_manual(
    values = c("green", "blue", "orange", "red", "yellow", "grey")  )+theme_minimal()
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).

#We set our colors to match, approximately, the scale of age in mind.

It’s hard to read each of the dots. Let’s just create an interactive graph.

p<-p+geom_point() +labs(y="IGN Score", x="Steam Average Review") + 
  guides(size="none")+ 
  #We set guides to none because plotly was incorrectly treating our sales variable as categorical. Let's not deal with it.
  scale_color_manual(
    values = c("green", "blue", "orange", "red", "yellow", "grey")
  )+theme_minimal()
#The rest of the settings are the same as above.
ggplotly(p)
z<-ggplotly(p)

saveWidget(z, file = "steam_ign_scatter.html")
#Let's save our interactive plot as an html file.

It turns out, from our inspection of the data, that our joint IGN and Steam table already have an ESRB rating, so we do not need to join an additional table. However, since we do have a seperate ESRB dataset, it might just be worth visualizing just in case there are new games not on PC. Let’s repeat the same technique that we’ve tried earlier in the data so far.

# Load the data
q <- read.csv("https://raw.githubusercontent.com/JesseLoi/Test/refs/heads/main/Video_games_esrb_rating.csv")


#Let's filter the ESRB data. It isn't necessary.
q<- q %>% select(title, esrb_rating)

#We perform the same cleaning we have from before.
q$cleaned_column <- tolower(gsub("[^a-zA-Z0-9]", "", q$title))

w$cleaned_column <- tolower(gsub("[^a-zA-Z0-9]", "", w$Name))
#We only take the alphanumeric and lowercased version of the name.


# Rename 'title' in q to 'Name' for easier matching
r <- q %>% rename(Name = title)

# We still filter, but this time rather aggressively. This is because we can be confident games from similar franchises will have similar ESRB ratings. 
r <- r %>% mutate(Name_prefix = substr(cleaned_column, 1, 12))
w <- w %>% mutate(Name_prefix = substr(cleaned_column, 1, 12))


# Let's join our tables.
result <- inner_join(w, r, by = "Name_prefix")
## Warning in inner_join(w, r, by = "Name_prefix"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 30 of `x` matches multiple rows in `y`.
## ℹ Row 559 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
r_unique <- result %>% distinct(Name.x, .keep_all = TRUE)
#We remove the many to many relationships to simplify our table.

r_unique$Year <- as.numeric(r_unique$Year)
#Ensure our column is numeric

#We shorten our names to add in more names.
r_unique$short_name <- substr(r_unique$cleaned_column.x, 1, 12) 
p<-ggplot(data=r_unique, aes(x=Year, y=Global_Sales, color=esrb_rating))


p+geom_point() + scale_x_continuous(breaks = seq(1980, 2020, by = 5)) + geom_text_repel(size=3, data=r_unique, mapping = aes(label = short_name), max.overlaps = 5, show.legend = FALSE)
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
## Warning: ggrepel: 727 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#Let's use ggrepel to label points.

We could try to convert this to an interactive plot as above, but let’s avoid that for now, since this plot isn’t used in the final project.

Let’s look at another feature that was overlooked in the final project. Namely, the most popular game in each region.

We see the distribution of sales across three regions with a boxplot.

v<-w
v_long <- v %>%
  pivot_longer(cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales), 
               names_to = "Region", 
               values_to = "Sales")
#We reshape our data using pivot longer. This means we break up each game into three different rows, based on the columns mentioned in the argument above. We then add two new columns called region and sales to split up our three columns of regions to two columns, region and sales (from that region).
#Essentially, what the above function does is convert the name of the columns into columns of their own and creates a new column for sales values. Turning a 4XN matrix into a 2X4N matrix

# Create boxplots
ggplot(v_long, aes(x = Region, y = Sales)) +  scale_y_log10(limits = c(1, 50))+
  geom_boxplot() +
  labs( x = "Region", y = "Sales in Millions")
## Warning in scale_y_log10(limits = c(1, 50)): log-10 transformation introduced
## infinite values.
## Warning: Removed 64688 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# We scale our data since some entries have earned quite a lot. Additionally, we label the axes.

Find the bottom visual displaying the countries where each game was most popular.

#Goal: for every game, let's take the max sales across all regions and plot it (showing where each game was most popular in the world and to what degree)

v$Max_Sales <- pmax(v$NA_Sales, v$EU_Sales, v$JP_Sales, v$Other_Sales, na.rm = TRUE)

#We create a new column Max_Sales which is meant to store the maximum value of th columns listed.

v$Max_Sales_Region <- apply(v[, c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales")], 1, function(x) {
  colnames(v)[which.max(x) + 6]
})
#Let's now create another new column with the name of the region (or other) which stores the name of the source of the max sales for each game. We produce a function such that it takes the maximum of the columns starting from the 6th over (where the number of sales begins). We use which.max() to index our maximum function such that it finds the maximum, since we took a subset of our dataset in the anonymous function.

v$short_name <- substr(v$Name, 1, 9) 
#We shorten the names of our games to the first 9 characters use the substr function.

v$Year<-as.numeric(v$Year)
q<-ggplot(data=v, aes(x=Year, y=Max_Sales, color=Max_Sales_Region))

#Create a scatterplot based on year and max_sales (across all regions) and color the points based on the region of max sales.

q+geom_point()+coord_flip() + scale_y_log10(limits = c(4, 50)) +geom_text_repel(size=3, data=v[v$Max_Sales_Region=="EU_Sales"|v$Max_Sales_Region=="JP_Sales"|v$Max_Sales_Region=="Other_Sales"|v$Max_Sales>10,], mapping = aes(label = short_name), max.overlaps = 10, show.legend=FALSE)+ scale_x_continuous(breaks = seq(1980, 2020, by = 5))
## Warning in scale_y_log10(limits = c(4, 50)): log-10 transformation introduced
## infinite values.
## Warning: Removed 16491 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 6460 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#Add labels and a log scale. We will exclude games which had less than 4 million dollars in sales to prevent clogging the data.

Since this is not used in the final project, I decided not to finalize the data. However, there is one way the data could be much more reaadable. As it is, all the horizontal labels get in the way because too many games are released in the same year, which makes sense. How about we swap the axes?

v<-w

v$Max_Sales <- pmax(v$NA_Sales, v$EU_Sales, v$JP_Sales, v$Other_Sales, na.rm = TRUE)



#We create a new column Max_Sales which is meant to store the maximum value of th columns listed.

v$Max_Sales_Region <- apply(v[, c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales")], 1, function(x) {
  colnames(v)[which.max(x) + 6]
})




v$Max_Sales_Region <- factor(v$Max_Sales_Region, levels = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))

#Let's now create another new column with the name of the region (or other) which stores the name of the source of the max sales for each game. We produce a function such that it takes the maximum of the columns starting from the 6th over (where the number of sales begins). We use which.max() to index our maximum function such that it finds the maximum, since we took a subset of our dataset in the anonymous function.


v$abrev_name<-gsub("Grand Theft Auto", "GTA", v$Name)
#We see Grand Theft Auto being quite popular, so let's remove it.

v$abrev_name<-gsub("The", "", v$abrev_name)
#The word "The" can be quite superfluous.

v$abrev_name<-gsub("Call of Duty", "CoD", v$abrev_name)
#We suspect the call of duty franchise will take up lots of space, so let's get rid of that.

v$abrev_name<-gsub("Super", "", v$abrev_name)
#This is because many games, like Super Mario Bros, can be shortened.
v$abrev_name<-gsub(":.*","", v$abrev_name)
#Remove anything following a semicolon since it usually indicates an edition, which we are not very interested in.
v$Year<-as.numeric(v$Year)


q<-ggplot(data=v, aes(x=Year, y=Max_Sales, color=Max_Sales_Region))

#Create a scatterplot based on year and max_sales (across all regions) and color the points based on the region of max sales.



q+geom_point(alpha=0.5)+scale_x_continuous(breaks = seq(1980, 2020, by = 5))+#we take every 5 years to avoid clogging
  scale_y_log10(limits = c(4, 45)) +geom_text_repel(size=3, data=subset(v, Max_Sales_Region %in% c("JP_Sales", "EU_Sales") | Max_Sales>10 |Year<1994), mapping = aes(label = abrev_name), max.overlaps = 20, show.legend = FALSE)
## Warning in scale_y_log10(limits = c(4, 45)): log-10 transformation introduced
## infinite values.
## Warning: Removed 16491 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 6569 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).

#We then limit our log y transformation to try to filter out vastly unpopular games, as they are nnot as important. We then use ggrepel to label EU and JP popular games, as the US dominates. But we will add labels to parts of the graph that are also not densely packed.

This is much better for us now. We also try a new abbreviation function to make games more readable. It seems to be working quite well. However, if we wanted, we could convert this into an interactive visual ourselves.

Let’s now also plot gdp vs European sales. Ting-Yu will have made graphs for North America and Japan.

# Load dataset
n <- read.csv("https://raw.githubusercontent.com/JesseLoi/Test/refs/heads/main/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_142.csv")

# Check column names to see what we need to clean
colnames(n)
##  [1] "Country.Name"   "Country.Code"   "Indicator.Name" "Indicator.Code"
##  [5] "X1960"          "X1961"          "X1962"          "X1963"         
##  [9] "X1964"          "X1965"          "X1966"          "X1967"         
## [13] "X1968"          "X1969"          "X1970"          "X1971"         
## [17] "X1972"          "X1973"          "X1974"          "X1975"         
## [21] "X1976"          "X1977"          "X1978"          "X1979"         
## [25] "X1980"          "X1981"          "X1982"          "X1983"         
## [29] "X1984"          "X1985"          "X1986"          "X1987"         
## [33] "X1988"          "X1989"          "X1990"          "X1991"         
## [37] "X1992"          "X1993"          "X1994"          "X1995"         
## [41] "X1996"          "X1997"          "X1998"          "X1999"         
## [45] "X2000"          "X2001"          "X2002"          "X2003"         
## [49] "X2004"          "X2005"          "X2006"          "X2007"         
## [53] "X2008"          "X2009"          "X2010"          "X2011"         
## [57] "X2012"          "X2013"          "X2014"          "X2015"         
## [61] "X2016"          "X2017"          "X2018"          "X2019"         
## [65] "X2020"          "X2021"          "X2022"          "X2023"         
## [69] "X"
# We remove the unnecessary columns here
n <- n[, !colnames(n) %in% c("Indicator.Name", "Indicator.Code", "Country.Code", "X")]

# Remove "X" from column names
colnames(n) <- gsub("X", "", colnames(n))
# Reshape the data to long format

re_n <- n %>%
  pivot_longer(
    cols = -c(Country.Name), # Keep these columns as identifiers
    names_to = "year",                     # New column for years
    values_to = "gdp"                      # New column for GDP values
  )

#Let's ensure the columns are numeric
re_n$year<-as.numeric(re_n$year)
re_n$gdp <-as.numeric(re_n$gdp)


g_e<- gapminder %>% filter(continent == "Europe")
#We will be using gapminder to filter our European countries so that we can extract our European GDP. This will just extract the names of European countries. We won't be getting any GDP yet.

re_n= re_n %>% filter(Country.Name %in% g_e$country)

#Let's extract the gdp of European countries, using g_e as a filter.

avg_n<- re_n %>% group_by(year) %>% summarize(avg=mean(gdp, na.rm=TRUE))

#Then let's take the average mean of the European countries by grouping them together.
#We will plot this average like alongside the GDP's of the many euopean countries.


df_e= re_n %>% filter(Country.Name %in% g_e$country)
#This contains the GDP's of the many European countries.



p<-ggplot(data=df_e, aes(x=year, y=gdp, group=Country.Name))

#This plots the lines for the different GDPs of the European countries.

# Filter the gapminder data for Europe after 1980
EGDP <- gapminder %>%
  filter(continent == "Europe" & year > 1980)

# Summarize data to get total EU_Sales per Year
q<-w
#We load in the video game sales data
yearly_sales <- q %>%
  group_by(Year) %>%
  summarize(total_EU_Sales = sum(EU_Sales, na.rm = TRUE)*100)
#We group sales by years but then we multiply the sales by 100 so that our line is comparable to the GDP lines.

p <- ggplot(data = df_e, aes(x = year, y = gdp, group = Country.Name, color = "Country GDP")) + 
  geom_line(show.legend = TRUE, alpha = 0.5) +
  #We first plot our different European countries and their GDPs
  labs(title = "European GDP and Video Game Sales (1980-2015)", y = "GDP per capita", x = "Year") +
  #Let's restrict our data just to what we have in the video game sales.
  xlim(1980, 2015) +theme_minimal()+
  
  # Add average GDP line
  geom_line(data = avg_n, aes(x = year, y = avg,group=NA, color = "Avg GDP (Europe)"), size = 3, show.legend = TRUE) +
  #We increase the size of the average GDP line to make it stand out.
  # Add total EU Sales line
  geom_line(data = yearly_sales, aes(x = Year, y = total_EU_Sales,group=NA, color = "Total Europe Sales in 10,000s"), size = 2, show.legend = TRUE) +
  #We increase the size of the game sales line for it to stand out.
  # Secondary y-axis for our video game sales
  scale_y_continuous(sec.axis = sec_axis(~ . / 1, name = "Total Europe Sales in 10,000s")) +
  
  # Adjust theme for legend
   theme(
    legend.position = c(0.25, 0.8), # We position the legend in the top-left corner
    legend.background = element_rect(fill = "white", color = "black", size = 0.3), # Add a white background with a border
    legend.title = element_blank(), # Remove legend title
    legend.text = element_text(size = 8)    # Adjust legend text size
  ) + scale_color_manual(values = c("Country GDP" = "grey", "Avg GDP (Europe)" = "black", "Total Europe Sales in 10,000s" = "red")) +
  #Add a red and black color to stand out among the country GDP lines
  # Customize legend and axes
  labs(color="Sales & GDP", x = "Year", y = "GDP per capita", linetype = "Type")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p
## Warning: Removed 876 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).